Analyzing Portuguese “Vinho Verde” Red Wine Quality

by Bhavin V. Choksi


This project analyzes the physicochemical properties that affect the quality of 1599 variants of the Portuguese “Vinho Verde” red wine.

The physicochemical properties in the data set are based on objective tests.

The quality of each wine is graded from 0 (very bad) to 10 (very excellent), based on the median of at least 3 evaluations by wine experts.

My objective is to determine which of the physicochemical properties affect wine quality, and then build a linear model based on those factors to predict quality.

Property Unit Description
Fixed acidity gm/L Most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
Volatile acidity gm/L The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
Citric acid gm/L Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
Residual sugar gm/L The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
Chlorides gm/L The amount of salt in the wine.
Free sulfur dioxide mg/L The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
Total sulfur dioxide mg/L Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
Density gm/mL The density of wine is close to that of water depending on the percent alcohol and sugar content.
pH Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
Sulphates gm/L A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
Alcohol % The percent alcohol content of the wine.

Wine Quality Data Set Information


Data

Confirming that all 1599 rows were loaded.

## [1] 1599

Confirming that all columns were loaded.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Sample data shows that quality has discrete numeric values, and all physicochemical properties have continuous numeric values.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Univariate Plots and Analysis

Wine Quality

A summary and plot of quality shows that wines in the data set have grades between 3 and 8. None of the wines are close to being very bad or very excellent.

For the purpose of analysis, I have categorized grades 3 and 4 as Low, grades 5 and 6 as Medium, and grades 7 and 8 as High.

About 4% of the wines are of Low quality. 82.5% of the wines are of Medium quality. 13.5% of the wines are of High quality.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##    Low Medium   High 
##     63   1319    217
## 
##        Low     Medium       High 
## 0.03939962 0.82489056 0.13570982

Some wine qualities are outliers (grades 3 and 8) for the provided data. However, they do belong in the data set for further analysis as each wine undergoes at least three evalauations, and hence cannot be errors.

## [1] 3 8

Fixed Acidity

The values look normally distributed with few outliers, with most wines here having fixed acidity in the range of 5 to 11 gm/L.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Volatile Acidity

Volatile acidity also seems to be normally distributed, most wines here having acetic acid in the range of 0.2 to 0.9 gm/L.

The mean and median are close, 0.53 and 0.52 respectively.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Citric Acid

Over 8% of the wines here (132 of 1599) have no citric acid, the rest having less than 0.8 gm/L.

Mean and median are close, 0.27 and 0.26 respectively.

## 
## FALSE  TRUE 
##  1467   132
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Residual Sugar

Most wines here have between 1 and 3 gm/L of residual sugar, with some outliers having up to 15 gm/L.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Chlorides

Most wines here have 0.04 to 0.12 gm/L of salt, with a few outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Free and Total Sulfur Dioxide

The distribution of values is long tailed, with most wines here having free sulfur dioxide in the range of 3 to 40 mg/L, and total sulfur dioxide in the range of 6 to 150 mg/L.

Only wines with over 50 ppm (mg/L) of free sulfur dioxide concentrations are detectable to affect the nose and taste. Just 1% (16 of 1599) of the wines here have free sulfur dixoide concentrations above 50 ppm.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## 
## FALSE  TRUE 
##  1583    16

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Density

Wine density looks normally distributed, in a close range between 0.990 and 1.004 gm/mL.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

pH

pH levels look normally distributed, mainly in the range of 3.0 to 3.6.

The mean and median are identical at 3.31.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Sulphates

Sulphates content is mostly between 0.45 to 0.9 gm/L. There are some outliers on the higher end of the value range.

The mean and median are close, 0.62 and 0.6581 respectively.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Alcohol

Most wines here have an alcohol content between 9 and 13%.

Mean and median are a little over 10%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90


Bivariate Plots and Analysis

Analyzing Physicochemical Properties vs. Quality

The following charts box plot each property against quality. The pattern on the charts indicate a linear relation between quality and some of the properties such as:

  • volatile acidity
  • citric acid
  • density
  • sulphates
  • alcohol

Exploring Conventional Wisdom

Citric Acid and Quality

The data set description states that “citric acid can add ‘freshness’ and flavor to wines”. The box plot does indicate that, but I want to explore it further.

About 8% of the wines here (132 of 1599) have no citric acid. I want to compare the distribution of grades for wines with and without citric acid.

The charts below show that only 6% of wines with no citric acid (8 of 132) are of High quality (grade 7 or better).

Over 14% of wines with citric acid (209 of 1467) are of High quality.

The data analysis seems to confirm that the presence of citric acid does influence wine quality positively.

## 
##   3   4   5   6   7   8 
##   7  43 624 584 191  18
## 
##  3  4  5  6  7 
##  3 10 57 54  8

Free Sulfur Dioxide and Quality

The data set description states that “at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine”. Only 1% of the wines here (16 of 1599) meet that criteria, but I want to explore if free SO2 could impact wine quality.

About 13.5% of the wines here (215 of 1581) with free SO2 concentrations below 50 ppm are of High quality (grade 7 or better).

On the other hand, only 12.5% of the wines here (2 of 16) with free SO2 concentrations over 50 ppm are of High quality.

Although this finding contradicts conventional wisdom, it is worth noting that no wine here with over 50 ppm free SO2 concentration is of Low quality (grade 4 or lower).

## 
##   3   4   5   6   7   8 
##  10  53 671 632 197  18
## 
## 5 6 7 
## 9 5 2

Statistical Measures

The correlation coefficients confirm that volatile acidity, sulphates and alcohol influence wine quality to an extent that could be significant, and citric acid and density to a lesser degree.

##                property           r    min     max
## 1         fixed.acidity  0.12405165 4.6000  15.900
## 2      volatile.acidity -0.39055778 0.1200   1.580
## 3           citric.acid  0.22637251 0.0000   1.000
## 4        residual.sugar  0.01373164 0.9000  15.500
## 5             chlorides -0.12890656 0.0120   0.611
## 6   free.sulfur.dioxide -0.05065606 1.0000  72.000
## 7  total.sulfur.dioxide -0.18510029 6.0000 289.000
## 8               density -0.17491923 0.9901   1.004
## 9                    pH -0.05773139 2.7400   4.010
## 10            sulphates  0.25139708 0.3300   2.000
## 11              alcohol  0.47616632 8.4000  14.900

Analyzing Relationship Between Physicochemical Properties

Citric Acid and Fixed Acidity

Citric acid being non-volatile has a positive linear relationship with fixed acidity.

From the scales it seems citric acid is a small part of overall fixed acidity.

Fixed acidity in wine.

Citric Acid and Volatile Acidity

Acetic acid, the volatile acid in wine, seems to be lower in wines when citric acid is higher.

Fixed Acidity and Density

There is a positive linear relationship between fixed acidity and density.

This is to be expected as the fixed acids found in wine are denser than water (1 gm/mL).

Fixed Acid in Wine Density
Tartaric Acid 1.79 gm/mL
Malic Acid 1.61 gm/mL
Citric Acid 1.67 gm/mL
Succinic Acid 1.56 gm/mL

Acidity and pH

As expected, pH levels have an inverse linear relationship with fixed acidity. The lower the pH level, the more acidic the solution.

Residual Sugar and Chlorides, and Density

The higher the residual sugar and salt in a wine, the denser it seems to be.

Measuring wine density is a method for categorizing it as Dry, Medium-sweet or Sweet.

Wine fermentation.

Density and Alcohol

The plot shows an inverse linear relationship between alcohol and density, the higher the alcohol content the lower the density.

The fermentation process converts sugars in grape juice to ethanol (ethyl alcohol). Density of ethanol is 0.789 gm/mL, which is lower than that of water (1 gm/mL).

Wine fermentation.


Multivariate Plots and Analysis

Density Estimates by Quality

These density charts indicate that there is a higher probability that wines with better grades have higher alcohol, sulphates and citric acid content but lower volatile acidity.

Histograms by Quality

The three histograms below confirm that wines of lower quality tend to have lower alcohol and sulphates content but higher volatile acidity, and vice versa.

On the following histograms, wines of all qualities seem to be spread across the range of values of citric acid and density, indicating a weaker correlation between those properties and quality.


Final Plots and Summary

Plots, statistical measures and analysis thus far seem to indicate that there is a strong correlation between the three physicochemical properties alcohol, volatile acidity and sulphates, and wine quality.

The following plots provide further confirmation that these three properties can be significant in predicting wine quality.

This plot indicates that wines of High quality tend to have higher alcohol content and lower volatile acidity.

This scatter plot indicates that Low and Medium quality wines are concentrated at points where sulphates content is lower and volatile acidity is higher.

This chart shows that as alcohol and sulphates content increases, wine quality gets better.

Interpreting the Charts and Estimated Coefficients

Property Linear Relation Comments
Fixed acidity Weak, positive. -
Volatile acidity Medium to strong, negative. Acetic acid at high levels can lead to an unpleasant, vinegar taste.
Citric acid Weak to medium, positive. Citric acid can add ‘freshness’ and flavor to wines.
Residual sugar Very weak, positive. All wines in the data set are fairly dry.
The highest residual sugar level is 15.50 gm/L.
Only wines over 45 gm/L are considered sweet.
Chlorides Weak, negative. -
Free sulfur dioxide Very weak, negative. -
Total sulfur dioxide Weak, negative. -
Density Weak, negative. Density of water is 0.99997 gm/mL.
All wines in the data set are close to the density of water (0.99 to 1.004).
pH Very weak, negative. On a scale of 0 (very acidic) to 14 (very basic) most wines are between 3-4.
All wines in the data set are between 2.74 and 4.01.
Sulphates Medium, positive. Sulphates contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
Interestingly, free and total sulfur dioxide levels do not seem to impact quality.
Alcohol Strong, positive. -

Building a model

Let’s build a linear model to predict wine quality using the following properties that have a medium to strong linear relation with quality as predictor variables:

  • alcohol
  • volatile acidity
  • sulphates

We need two distinct samples from red wine quality data. One sample will be used to train the linear model. The other sample will be used to test the model, and compare its results with the actual evaluation by wine experts.

#1,500 rows in training set
#99 rows in test set
set.seed = 1056

sample.indices = sample(1:nrow(wqr), 1500)

training <- wqr[sample.indices, ]
test <- wqr[-sample.indices, ]

#Linear model
m1 <- lm(quality ~ alcohol, data = training)
m2 <- update(m1, ~ . + volatile.acidity)
m3 <- update(m2, ~ . + sulphates)

The linear model seems to be a good fit for the data based on the summary below:

  • The R^2 value indicates that 33% of wine quality is due to its three properties - alcohol, volatile acidity and sulphates.

  • The R^2 value of a model with only alcohol as a predictor variable indicates that 22% of wine quality is due to alcohol alone.

  • Three significance stars (***) next to each property indicate that it is unlikely that no relationship exists between them and wine quality.

  • A p-value of 0.000 for each property indicates a very low probability that they are not relevant in predicting wine quality.

mtable(m1, m2, m3)
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = training)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = training)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = training)
## 
## ===============================================
##                      m1        m2        m3    
## -----------------------------------------------
## (Intercept)        2.006***  3.281***  2.817***
##                   (0.181)   (0.189)   (0.201)  
## alcohol            0.348***  0.300***  0.296***
##                   (0.017)   (0.016)   (0.016)  
## volatile.acidity            -1.465*** -1.303***
##                             (0.097)   (0.100)  
## sulphates                              0.646***
##                                       (0.104)  
## -----------------------------------------------
## R-squared             0.213     0.316     0.334
## adj. R-squared        0.212     0.315     0.332
## sigma                 0.712     0.663     0.655
## F                   404.764   346.213   249.559
## p                     0.000     0.000     0.000
## Log-likelihood    -1616.849 -1511.100 -1491.906
## Deviance            758.348   658.617   641.977
## AIC                3239.698  3030.199  2993.813
## BIC                3255.638  3051.452  3020.379
## N                  1500      1500      1500    
## ===============================================

Testing the model

A test of the model results in residuals that are fairly normally distributed, again indicating that the three properties are significant in predicting wine quality.

#Predict
estimate <- predict(m3, newdata = test, interval = "prediction", level = 0.95)
estimate <- data.frame(estimate)
estimate$actual.quality <- NA
estimate$actual.quality <- test$quality
estimate$residual <- NA
estimate$residual <- estimate$fit - estimate$actual.quality


Reflection

I started out with an analysis of each individual data element to get an idea of the nature and distribution of its values. Univariate analysis indicated that most wines are of Medium quality, and none of the wines have extremely low or high ratings. Certain properties, such as citric acid and free sulfur dioxide, that can positively impact quality were not found to be prevalent at high rates or levels in the wines.

The analysis then progressed to test the impact of each physicochemical property on quality. Alcohol, volatile acidity, sulphates, citric acid and density were found to have a linear relationship with quality.

Relationships between each individual property were also analysed, revealing correlation between density and alcohol, density and residual sugars, fixed acidity and citric acid, besides others. Many of those relationships could be explained by their physical and chemical attributes.

Multivariate analysis on alcohol, volatile acidity, sulphates, citric acid and density revealed that citric acid and density did not have as strong a linear relationship on wine quality as the other three properties. The correlation coefficients confirmed that finding.

A linear model was built using alcohol, volatile acidity and sulphates as predictor variables. The table of estimates and a plot of the residuals indicated that the model was a good fit.

However, only 33% of wine quality is due to those three properties. It seems natural that more than just 3 of 11 physicochemical properties of wine should determine quality. A larger data set with a greater range of values for certain properties such as citric acid and free sulfur dioxide may allow us to use more predictor variables and build a better model.